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Abstract — We consider a problem where mutually untrusting 
curators possess portions of a vertically partitioned database 
containing information about a set of individuals. The goal is 
to enable an authorized party to obtain aggregate (statistical) 
information from the database while protecting the privacy of the 
individuals, which we formalize using Differential Privacy. This 
process can be facilitated by an untrusted server that provides 
storage and processing services but should not learn anything 
about the database. This work describes a data release mech- 
anism that employs Post Randomization (PRAM), encryption 
and random sampling to maintain privacy, while allowing the 
authorized party to conduct an accurate statistical analysis of 
the data. Encryption ensures that the storage server obtains 
no information about the database, while PRAM and sampling 
ensures individual privacy is maintained against the authorized 
party. We characterize how much the composition of random 
sampling with PRAM increases the differential privacy of system 
compared to using PRAM alone. We also analyze the statistical 
utility of our system, by bounding the estimation error — the 
expected ^2 -norm error between the true empirical distribution 
and the estimated distribution — as a function of the number 
of samples, PRAM noise, and other system parameters. Our 
analysis shows a tradeoff between increasing PRAM noise versus 
decreasing the number of samples to maintain a desired level of 
privacy, and we determine the optimal number of samples that 
balances this tradeoff and maximizes the utility. In experimental 
simulations with the UCI "Adult Data Set" and with synthetically 
generated data, we confirm that the theoretically predicted 
optimal number of samples indeed achieves close to the minimal 
empirical error, and that our analytical error bounds match well 
with the empirical results. 



I. Introduction 

One of the most visible technological trends is the emer- 
gence and proliferation of large-scale data collection. Public 
and private enterprises are collecting tremendous volumes of 
data on individuals, their activities, their preferences, their 
locations, their medical histories, and so on. These enter- 
prises include government organizations, healthcare providers, 
financial institutions, internet search engines, social networks, 
cloud service providers, and many other kinds of private 
companies. Naturally, interested parties could potentially dis- 
cern meaningful patterns and gain valuable insights if they 
were able to access and correlate the information across these 
large, distributed databases. For example, a social scientist 
may want to determine the correlations between individual 
income with personal characteristics such as gender, race, age, 
education, etc., or a medical researcher may want to study 
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Fig. 1. An example in which curators Alice and Bob hold vertically 
partitioned data, and a sanitized combination of their databases is made 
available for statistical analysis. 



the relationships between disease prevalence and individual 
environmental factors. In such applications, it is imperative to 
maintain the privacy of individuals, while ensuring that the 
useful aggregate (statistical) information is only revealed to 
the authorized parties. Indeed, unless the public is satisfied 
that their privacy is being preserved, they would not provide 
their consent for the collection and use of their personal 
information. Additionally, the inherent distribution of this 
data across multiple curators present a significant challenge, 
as privacy concerns and policy would likely prevent these 
curators from directly sharing their data to facilitate statistical 
analysis in a centralized fashion. Thus, tools must be devel- 
oped for conducting statistical analysis on large and distributed 
databases, while addressing these privacy and policy concerns. 

As an example, consider two curators Alice and Bob, 
who possess two databases containing census-type information 
about individuals in a population, as shown in Figure 1. 
Suppose that this data is to be combined and made available 
to authorized researchers studying salaries in the population, 
while ensuring that the privacy of the individual respondents is 
maintained. Conceptually, a data release mechanism involves 
the "sanitization" of the data (via some form of perturbation 
or transform) to preserve individual privacy, before making 
it available for data analysis. The suitability of the method 
used to sanitize the data is determined by the extent to which 
rigorously defined privacy constraints are met. 

Recent research has shown that conventional mechanisms 



for privacy, such as /c-anonymization [1], [2] do not provide 
adequate privacy. Specifically, an informed adversary can link 
an arbitrary amount of side information to the anonymized 
database, and defeat the anonymization mechanism [3]. In 
response to vulnerabilities of simple anonymization mecha- 
nisms, a stricter notion of privacy — Differential Privacy [4], 
[5] — has been developed in recent years. Informally, differ- 
ential privacy ensures that the result of a function computed on 
a database of respondents is almost insensitive to the presence 
or absence of a particular respondent. A more formal way of 
stating this is that when the function is evaluated on adjacent 
databases (differing in only one respondent), the probability 
of outputting the same result is almost unchanged. 

Mechanisms that provide differential privacy typically in- 
volve output perturbation, e.g., when Laplacian noise is added 
to the result of a function computed on a database, it pro- 
vides differential privacy to the individual respondents in the 
database [6], [7]. Nevertheless, it can be shown that input 
perturbation approaches such as the randomized response 
mechanism [8], [9] - where noise is added to the data 
itself - also provide differential privacy to the respondents. 
In this work, we are interested in a privacy mechanism 
that achieves three goals. Firstly, the mechanism protects the 
privacy of individual respondents in a database. We achieve 
this through a privacy mechanism involving sampling and 
Post Randomization (PRAM) [10], which is a generalization 
of randomized response. Secondly, the mechanism prevents 
unauthorized parties from learning anything about the data. We 
achieve this using random pads which can only be reversed 
by the authorized parties. Thirdly, the mechanism achieves 
a superior tradeoff between privacy and utility compared to 
simply performing PRAM on the database. We show that 
sampling the database enhances privacy with respect to the 
individual respondents while retaining the utility provided to 
an authorized researcher interested in the joint and marginal 
empirical probability distributions. 

The idea of enhancing differential privacy via sampling, 
to the best of our knowledge, first appeared in [6], [11] and 
was further developed by [12]. Theorem 3.2 that we develop 
and prove herein is analogous to the privacy amplification 
result of Theorem 1 in [12], however, the theorems are proved 
differently. Specifically, our proof requires an extra and non- 
trivial step because of the fact that the definition of differential 
privacy and sampling method in our setting are different. In 
the definition of differential privacy used in [6], [11], [12], 
neighboring or adjacent databases are obtained by adding or 
deleting an entry from the database under consideration. This 
notion of adjacency cannot be used in our setting owing the 
fact that our setting involves perturbing the input data directly 
using techniques such as PRAM. In our work, an adjacent 
or neighboring database is obtained by replacing (i.e. deleting 
and adding) a single entry to the database under consideration. 
Further, the work in [6], [11], [12] uses sampling with a 
fixed probability of including or excluding a sample, while 
our sampling mechanism is slightly different: the number of 
samples is fixed, and then sampling is carried out uniformly 
and without replacement based on the ratio of the number 
of samples to the size of the original database. This requires 



a different proof technique that considers sets of possible 
samplings. 

The more significant difference with respect to recent work 
is that, unlike [12], we conduct a utility analysis, and derive 
a bound on the accuracy with which the desired statistical 
measures can be estimated, as a function of the noise inserted 
for privacy and the number of samples. Our analysis reveals a 
privacy-utility tradeoff between increasing PRAM noise versus 
decreasing the number of samples to maintain a desired level 
of differential privacy, and we determine the optimal number 
of samples that balances this tradeoff and maximizes the 
utility. We carry out experiments on both real-world and syn- 
thetically generated data which confirm the existence of this 
tradeoff, and reveal that the experimentally obtained optimal 
number of samples is very close to the number predicted by 
our analysis. 

Another related work examines the effect of sampling on 
crowd-blending privacy [13]. This is a strictly relaxed version 
of differential privacy, but it is shown that a pre-sampling step 
applied to a crowd-blending privacy mechanism can achieve 
a desired amount of differential privacy. The scenario in our 
work differs from the treatment in [13] in that we consider 
vertically partitioned distributed databases which are held by 
mutually untrusting curators. In our setting, computing joint 
statistics requires a join operation on the databases, which 
implies that individual curators cannot independently blend 
their respondents without altering the joint statistics across all 
databases. 

The remainder of this paper is organized as follows: 
Section II describes the multiparty problem setting, fixes 
notation and gives the privacy and utility definitions used 
in our analysis. Section III contains our main development, 
and begins by describing the mechanism itself, consisting of 
encryption via random padding, randomized sampling, and 
data perturbation. It is shown that sampling enhances the 
privacy of the individual respondents. An expression is derived 
for the utility function, namely the expected £2 -norm error in 
the estimate of the joint distribution, in terms of the number of 
samples and the amount of noise introduced by PRAM. More 
importantly, the analysis reveals a tradeoff between the number 
of samples and the perturbation noise. We conclude the section 
by deriving an expression for the optimal number of samples 
needed to maximize the utility function while achieving a 
desired level of privacy. In Section IV, the claims made in 
the theoretical analysis are tested experimentally with the UCI 
'Adult Data Set" [14] and with synthetically generated data. 
In particular, the theoretically predicted optimal number of 
samples, that minimizes the error in the joint distribution, is 
found to agree closely with the experimental results. Finally, 
Section V summarizes the main results and concludes the pa- 
per with a discussion on the practical considerations involved 
in performing privacy-preserving statistical analysis using a 
combination of encryption, sampling and data perturbation. 

II. Problem Formulation 

In this section, we present our general problem setup, 
wherein database curators wish to release data to enable 




The joint type of two sequences X" and F" is defined as the 
mapping Tx«,y"- '■ X x y ^ [0,1] given by 



W{x,y)eXxy, Txn,YAx,y):-- 



\{i:{Xi,Yi) = {x,y)}\ 



Fig. 2. Curators Alice and Bob independently encrypt their databases and 
provide it to a cloud server. The cloud server will sanitize the joint data. A 
researcher with decryption key can derive joint statistics or joint type based on 
the sanitized data, without compromising the privacy of individual database 
respondents. Neither the statistics nor the individual data entries are revealed 
to the cloud server. 



privacy-preserving data analysis by an authorized party. For 
ease of exposition, we present our problem formulation and 
results with two data curators, Alice and Bob, however our 
methods can easily be generalized to more than two curators. 
Consider a data mining application in which Alice and Bob 
are mutually untrusting data curators, as shown in Figure 2. 
The two databases are to be made available for research with 
authorization granted by the data curators, such that statistical 
measures can be computed either on the individual databases, 
or on some combination of the two databases. Data curators 
should have flexible access control over the data. For example, 
if a researcher is granted access by Alice but not by Bob, 
then he/she can only access Alice's data. In addition, the 
cloud server should only host the data and not be able access 
the information. The data should be sanitized, before being 
released, to protect individual privacy. Altogether, we have 
the following privacy and utility requirements: 

1) Database Security: Only researchers authorized by the 
data curators should be able to extract statistical informa- 
tion from the database. 

2) Respondent Privacy: Individual privacy of the respon- 
dents must be maintained against the cloud server as well 
as the researchers. 

3) Statistical Utility: An authorized researcher, i.e., one 
possessing appropriate keys, should be able to compute 
the joint and marginal distributions of the data provided 
by Alice and Bob. 

4) Complexity: The overall communication and computa- 
tion requirements of the system should be reasonable. 

In the following sections, we will present our system 
framework and formalize the notions of privacy and utility. 



A. Type and Matrix Notation 

The type (or empirical distribution) of a sequence X"^ is 
defined as the mapping Tx^ : A* ^ [0, 1] given by 



Vx G A', Tx^{x) 



\{i:Xi = x}\ 



For notational convenience, when working with finite- 
domain type/distribution functions, we will drop the arguments 
to represent and use these functions as vectors/matrices. For 
example, we can represent a distribution function Px '• ^ -^ 
[0,1] as the \X\ x 1 column- vector Px, with its values 
arranged according to a fixed consistent ordering of A'. Thus, 
with a slight abuse of notation, using the values of X to 
index the vector, the "x"-th element of the vector, Px[x], is 
given by Px{x). Similarly, a conditional distribution function 
Py\x ' yx^ -^ [0, 1] can be represented as a |y | x | A'| matrix 
Py|x, defined by PY\x[y^x] '= PY\x{y\x)- For example, by 
utilizing this notation, the elementary probability identity 

V^ e y, Pviy) = ^ PY\x{y\x)Px{x), 
can be written in matrix form as simply Py = Py\xPx- 



B. System Framework 

Database Model: The data table held by Alice is modeled 
as a sequence X^ := (Xi,X2, . . . ,Xn), with each Xi 
taking values in the finite-alphabet X. Likewise, Bob's data 
table is modeled as a sequence of random variables V^ := 
(yi,F2, • • • ,^n), with each Yi taking values in the finite- 
alphabet y. The length of the sequences, n, represents the 
total number of respondents in the database, and each (X^, Yi) 
pair represents the data of the respondent i collectively held 
by Alice and Bob, with the alphabet X x y representing the 
domain of each respondent's data. 

Data Processing and Release: The curators each apply 
a data release mechanism to their respective data tables to 
produce an encryption of their data for the cloud server 
and a decryption key to be relayed to the researcher. These 
mechanisms are denoted by the randomized mappings Fa ' 
X"" ^ OaX JCa and Fb : ^ ^ Ob x JCb, where JCa 
and JCb are suitable key spaces, and Ob and Oa are suitable 
encryption spaces. The encryptions and keys are produced and 
given by 

{Oa,Ka):=Fa{X^), 
{Ob^Kb):=Fb{Y^). 

The encryptions Oa and Ob are sent to the cloud server, 
which performs processing, and the keys Ka and Kb are 
later sent to the researcher. The cloud server processes Oa 
and Ob, producing an output O via a random mapping M : 

Oa X Ob ^ O, SiS given by 

0:=M{0a,0b). 

Statistical Recovery: To enable the recovery of the statis- 
tics of the database, the processed output O is provided to the 
researcher via the cloud server, and the encryption keys Ka 
and Kb are provided by the curators. The researcher produces 
an estimate of the joint type (empirical distribution) of Alice 



and Bob's sequences, denoted by Tx^^y^, as a function of O, 
Ka, and Kb, as given by 

where g : O x JCa x JCb -^ [0, 1]'^^^ is the reconstruction 
function. 

The objective is to design a system within the above 
framework, by specifying the mappings F4, Fb, M, and g, 
that optimize the system performance requirements, which are 
formulated in the next subsection. 

C Privacy and Utility Conditions 

In this subsection, we formulate the privacy and utility 
requirements for our problem framework. 

Privacy against the Server: In the course of system 
operation, the data curators do not want reveal any information 
about their data tables (not even aggregate statistics) to the 
cloud server. A strong statistical condition that guarantees 
this security is the requirement of statistical independence 
between the data tables and the encrypted versions held by the 
server. The statistical requirement of independence guarantees 
security even against an adversarial server with unbounded 
resources, and does not require any unproved assumptions. 

Respondent Privacy: The data pertaining to a respondent 
should be kept private from all other parties, including any 
authorized researchers who aim to recover the statistics. We 
formalize this notion using e-differential privacy for the re- 
spondents as follows: 

Definition 2.1: [15] Given the above framework, the sys- 
tem provides e-differential privacy if for all databases 
(x^, y'^) and (f"^, y'^) in X'^ x y^, within Hamming distance 
di:/((x^, ^^), (f^, y"")) < 1, and all 5 C O x /Ca x Kb, 

Pr [{O^Ka.Kb) e S\{X\Y^) = {x^y^)] 

< e^Pr [{O.Ka^Kb) G S\{X^,Y^) = (i;^^^)] 

This rigorous definition of privacy is widely used and sat- 
isfies the privacy axioms of [16], [17]. Under the assumption 
that the respondents' data is i.i.d., this definition results in a 
strong privacy guarantee: an attacker with knowledge of all 
except one of the respondents cannot recover the data of the 
sole missing respondent [18]. 

Utility for Authorized Researchers: The utility of the 
estimate is measured by the expected £2 -norm error of this 
estimated type vector, given by 



E Tx^ 



yn 



Txr. 






Tx^,Y^{x,y)\ , 



with the goal being the minimization of this error. 

System Complexity: The communication and computa- 
tional complexity of the system are also of concern. The 
computational complexity can be captured by the complexity 
of implementing the mappings (Fa, Fb, M and g) that specify 
a given system. Ideally, one should aim to minimize the com- 
putational complexity of all of these mappings, simplifying the 
operations that each party must perform. The communication 



requirements is given by the cardinalities of the symbol 
alphabets (Oa, Ob, Ka, Kb, and O). The logarithms of these 
alphabet sizes indicate the sufficient length for the messages 
that must be transmitted in this system. 

III. Proposed System and Analysis 

In this section, we will present the details of our system, 
and analyze its privacy and utility performance. First, in Sec- 
tion III-A, we will describe how our system utilizes sampling 
and additive encryption, enabling a cloud server to join and 
perturb encrypted data in order to facilitate the release of 
sanitized data to the researcher. Next, in Section III-B, we 
analyze the privacy of our system and show that sampling 
enhances privacy, thereby reducing the amount of noise that 
must be injected during the perturbation step in order to obtain 
a desired level of privacy. Finally, in Section III-C, we analyze 
the accuracy of the joint type reconstruction, producing a 
bound on the utility as a function of the system parameters, 
viz., the noise added during perturbation, and the sampling 
factor. 

A. System Architecture 

The data sanitization and release procedure is outlined by 
the following steps: 

1) Sampling: The curators randomly sample their data, 
producing shortened sequences. 

2) Encryption: The curators encrypt and send these short- 
ened sequences to the cloud server. 

3) Perturbation: The cloud server combines and perturbs 
the encrypted sequences. 

4) Release: The researcher obtains the sanitized data from 
the server and the encryption keys from the curators, 
allowing the approximate recovery of data statistics. 

A key aspect of these steps is that the encryption and perturba- 
tion schemes are designed such that these operations commute, 
thus allowing the server to essentially perform perturbation 
on the encrypted sequences, and for the authorized researcher 
to subsequently decrypt perturbed data. In this section, we 
describe the details of each step from a theoretical perspective 
by applying mathematical abstractions and assumptions. Later 
on, we will discuss practical implementations towards the 
realizing this system. The overall data sanitization process is 
illustrated in Figure 3. 

Sampling: The data curators reduce their length-n database 
sequences (X^^Y^) to m randomly drawn samples. We as- 
sume that these samples are drawn uniformly without replace- 
ment and that the curatorswill both^ sample at the same lo- 
cations. We will let (X^, r^) := (Xi, . . . , X^, Yi, . . . , Ym) 
denote the intermediate result after sampling. Mathematically, 
the sampling result is described by, for all i in {1, . . . , m}, 

{X,,Y,) = {Xj^,YjJ, 

where /i , . . . , /^ are drawn uniformly without replacement 
from {1, . . . ,n}. 

Encryption: The data curators independently encrypt their 
sampled data sequences with an additive (one-time pad) 
encryption scheme. To encrypt her data, Alice chooses an 




Fig. 3. Curators Alice and Bob independently encrypt their databases with a 
one time pad and provide it to a cloud server. The server samples m respon- 
dents and then performs PRAM to guarantee privacy of the individual database 
respondents. A researcher can derive joint statistics or joint type based on the 
sanitized data, without compromising the privacy of the respondents. Neither 
the statistics nor the individual data entries are revealed to the cloud server. 



independent uniform key sequence V^ G A'"^, and produces 
the encrypted sequence 

X- := X^ © V^ := (Xi + y^, . . . , X^ + V^), 

where denotes addition^ appHed element-by-element over 
the sequences. Similarly, Bob encrypts his data, with the 
independent uniform key sequence W^ G y^, to produce 
the encrypted sequence 



y-Tl 



F" 



^W^ 



{Y^^Wi,...,Ym^Wm). 



Alice and Bob send these encrypted sequences to the cloud 
server, and will provide the keys to the researcher to enable 
data release. 

Perturbation: The cloud server joins the encrypted data 
sequences, forming ((Xi, Yi), . . . , (X^, Ym)), and perturbs 
them by applying an independent PRAM mechanism, pro- 
ducing the perturbed results (X ,F ). Each joined and 
encrypted sample, (Xi,!^), is perturbed independently and 
identically according to a conditional distribution, Pxy\x Y' 
that specifies a random mapping from {X x y) to {X x y). 
Using the matrix A := Pxyix y ^^ represent the conditional 
distribution, this operation can be described by 



X'^,Y'^\X^,Y' 



.{x"^,x^\x^,x"^)=l[A[{x,,y,),{x,,y,)]. 



i=l 



By design, we specify that ^4 is a ^-diagonal matrix, for a 
parameter 7 > 1, given by 



A[{x,y),{x,\ 



-f/q, if {x,y) = {x,y), 
1/q, O.W., 



where q := (7+ |^||3^| — 1) is a normalizing constant. 

Release: In order to recover the data statistics, the re- 
searcher obtains the sampled, encrypted, and perturbed data 

^The addition operation can be any suitably defined group addition opera- 
tion over the finite set Af. 



sequences, (X , F ), from the cloud server, and the encryp- 
tion keys, V^ and W^, from the curators. The researcher 
decrypts and recovers the sanitized data given by 



(X^,!^" 



(X ©F^,r ©i^^), 



which is effectively the data sanitized by sampling and PRAM 
(see Lemma 3.1 below). The researcher produces the joint type 
estimate by inverting the matrix A and multiplying it with the 
joint type of the sanitized data as follows 



Txr 



,Y^ 






Due to the 7-diagonal property of A, the PRAM perturba- 
tion is essentially an additive operation that commutes with 
the additive encryption. This allows the server to perturb the 
encrypted data, with the perturbation being preserved when the 
encryption is removed. The following Lemma summarizes this 
property, by stating that the decrypted, sanitized data recovered 
by the researcher, (X^,!^^), is essentially the sampled data 
perturbed by PRAM. 

Lemma 3.1: Given the system described above, we have 
that 



J^ X^,Y^\X^,Y^\'^ 



■m ^i^r. 



.y^) = X{A[{x,,m).{x,,m)]- 



B. Sampling Enhances Privacy 

In this subsection, we will analyze the privacy of our 
proposed system. Specifically, we show how sampling in 
conjunction with PRAM enhances the overall privacy for the 
respondents in comparison to using PRAM alone. Note that if 
PRAM, with the 7-diagonal matrix A, was applied alone to the 
full databases, the resulting perturbed data would have ln(7)- 
differential privacy. In the following lemma, we will show that 
the combination of sampling and PRAM results in sampled 
and perturbed data with enhanced privacy. 

Theorem 3.2: The proposed system provides e-differential 
privacy for the respondents, where 

'n + m(7 — 1)' 



In 



(1) 



Proof: The researcher receives the perturbed and en- 
crypted data from the server O := (X ,y ) and the keys 
{Ka^Kb) := {V^^W^) from the curators. However, since 
the sanitized data, (X^,F^), recovered by the researcher 
is a sufficient statistic for the original databases, that is, the 
following Markov chain holds 

(x^,r^) - (x^,f^) - (x'^,F'^,F^,iy^), 

we need only to show that, for all {x^^y^), {x^^y^), and 

{x^,y^) in A'^ X y^ with di:/((x^, ^^), (x^, ^^)) = 1, 



x^,^^!^^,^^ v-^ 1 y \'^ 1 y ) 



<e% 



in order to prove e-differential privacy for the respondents. 
Since dH{{x'^jy^)j {x'^-,y'^)) = 1, the two database differ in 
only one location. Let k denote the location where {xk^yu) 7^ 
{xk.yk)- 



Before we proceed, we introduce some notation regarding 
sampling to facilitate the steps of our proof. We will use the 
following notation for the set of all possible samplings 

6 := {7r|7r:= (tti, . . . ,7r^) G {1, . . . , n}^, tt^ ^iTj.yi^j}. 

The sampling locations (/i, . . . , /^) are uniformly drawn from 
the set B. We also define &k '= {tt G &\3i^7ri = k} 
to denote the subset of samplings that select location k, 
and O^ := 9\6/eto denote the subset of samplings 
that do not select location k. For tt G 6/e, we define 
GkiTT) := {tt' G ei\dH{7r,7r') = 1} as the subset of 6^ 
that replaces the selection of location k with any other non- 
selected location. We will also slightly abuse notation by using 
TT G 6 as sampling function for the database sequences, that is, 
7r(X^) := (X^,, . . . ,X^^), and similarly for 7r(r^). Using 
the above notation, we can rewrite the following conditional 
probability, 

-t^Xm^Y^\X^,Y^\^ '>y F '^ / 



where tt* G &k denotes the sampling that maximizes the ratio. 
Given the 7-diagonal structure of the matrix A, we have that 

since (7r*(x^),7r*(^^)) and (7r*(x^),7r*(^^)) differ in only 
one location, 

7-^a(7r*) < a(7rO, Vtt' G 6fc(7r*), 

since (7r*(x^),7r*(?/^)) and {ir^x'^) , ir^y'^)) differ in only one 
location, and 

since (7r'(x"'),7r' (?/"')) = {7T\x^)^7r\y^)). Given these con- 
straints, we can continue to bound the likelihood ratio as 



J^m ym|J5^?T, Y 



.(x™,y™k",2/") 






< 



ttGG 



< 



161 [ ^ 



-'- J^m Y'^lX'^ Y 



^^y^(x-,rk(x-),^(^")) 



TT'GG-, 



10 



TTGGfc 

1 v-^ 



m 



X^,Y^\X^,Y 



^^Y-{x^.r'W{x^),7r\y^) 



7r'GGfc(7r) 



where in the last equality we have rearranged the summations 
to embed the summation over tt' G 6^ into the summation 
over TT G 0/e. Note that sunmiing over all n' G B/e(7r) within a 
summation over all tt G 6/c covers all tt' G 6^, but overcounts 
each tt' exactly m times since each tt' G 6^ belongs to m of 
the 6/e(7r) sets across all tt G &k- Hence, a (1/m) term has 
been added to account for this overcount. 

To ease the use of the above expansion, we introduce the 
following shorthand notation for the summation terms, 

a(7r) :=Pj^^^y^|j^^^y^(x^,^|7r(x^),7r(^^)) 

I3{tt) :=Pxm^Yrr^\Xrr^,Yrr^{^'^^y^\^{^'')^^{y''))' 

Thus, the following probability ratio can be written as 

-L x^^ Y'^\X'^ Y'^y "> ^ \'^ 1 y ) 
-'- X'^ Y'^\X'^ Y'^y "> y \'^ 1 y ) 

E.ge.(/3W + ^E.'ge.w/3(^')) 
< max ^ r-?- —^ — ; — - 



n + m(7 — 1) g 



thus finishing the proof by bounding the likelihood ratio with 
e\ ■ 

To show e-differential privacy for a given e, we only need 
to upperbound the probability ratio by e^, as done in the above 
proof. A natural question is if this bound is tight, that is, 
whether there exists a smaller e for which the bound also holds, 
hence making the system more private. With the following 
example, we show that the value for e given in Theorem 3.2 
is tight. 

Example 3.3: Let a and b be two distinct elements in {X x 
3;). Let {x^,y^) = {h,a,a,...,a), {x^,r) = {a,a,...,a) 
and (x'^, y^) = (h,h,...,h). Let E denote the event that 
the first element (where the two databases differ) is sampled, 
which occurs with probability {m/n). We can determine the 
likelihood ratio as follows 
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Thus, the value of e given by Theorem 3.2 is tight. 

As a consequence of the privacy analysis of Theorem 3.2, 
we have that for given system parameters of database length n, 
number of samples m, and desired level privacy e, the level 
of PRAM perturbation, specified by the 7 parameter of the 
matrix A, must be 



7 = l + -(e'-l). 
m 



(2) 



Privacy against the server is obtained as a consequence 
of the one-time-pad encryption performed on the data prior 
to transmission to the server. It is straightforward to verify 



that the encryptions received by the server are statistically 
independent of the original database as a consequence of the 
independence and uniform randomness of the keys. 



Applying the smoothing theorem, the sampling error can be 
bounded by 



C. Utility Analysis 

In this subsection, we will analyze the utility of our pro- 
posed system. Our main result is a theoretical bound on the 
expected ^2 -norm of the joint type estimation error. Analysis 
of this bound will illustrate the tradeoffs between utility and 
privacy level e as function of sampling parameter m and 
PRAM perturbation level 7. Given this error bound, we can 
compute the optimal sampling parameter m for minimizing 
the error bound while achieving a fixed privacy level e. 

Theorem 3.4: For our proposed system, the expected £2- 
norm of the joint type estimate is bounded by 
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where c is the condition number of the 7-diagonal matrix A, 
given by 
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Proof: The expected £2 -norm error is given by 
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Applying the triangle inequality, we can bound the error as 
the sum of the error introduced by sampling and the error 
introduced by PRAM, as follows. 



E\\A-^T^ 



II X'^ ym. 






< 

-e\\a-^t^ 



m \^m 



-*- J^m yn 



l2* 



We will analyze and bound the sampling error, 

by utilizing the smoothing theorem by first bounding the 
conditional expectation 
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For a given (x, ^) G A' x 3^, the sampled type, T^^ y^ (x, y\ 
conditioned on Tx^y^-> is a hypergeometric random variable 
normalized by m, with expectation and variance given by 
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Applying Jensen's inequality to the conditioned sampling 
error yields 
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Next, to analyze and bound the PRAM error given by 
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we will make use of the following linear algebra lemma. 

Lemma 3.5: Let A be an invertible matrix and (x^y) be 
vectors that satisfy Ax = y. For any vectors (f , y) such that 
X = A~^y, we have 



If -^11 ^^ Jy-y\\ 



where c is the condition number of the matrix A. 

To bound the PRAM error, we will make use of the 
following consequence of this lemma. 
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which allows us to bound the conditional expectation of the 
PRAM error as follows. 
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For a given (x^y) G X x y, the perturbed and sampled 
type, Tj^^ ^^{x^y), conditioned on T^^ ^^, is a poisson- 
binomial random variable normalized by m with expectation 
and variance given by 



J-^ \_^ Xm ym \X^ y J^l -^^ ym \ V^"^ J5^m. ym )[X^ y\^ 

- E T^r.yA^',y')A[{x,y),{x',y')]{l-A[{x,y),{x',y')]). 



We can bound the following conditional expectation using 
Jensen's inequality to yield 
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Combining equations yields the bound 
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which, upon applying the smoothing theorem, yields the 
following bound on the PRAM error 
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Combining the individual bounds on the sampling and 
PRAM error via the triangle inequality yields the following 
bound on expected norm-2 error of the type estimate formed 
from the sampled and perturbed data. 
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Since A is a 7-diagonal matrix, its condition number c is 
given by 

1^113^1 
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Given a fixed PRAM perturbation parameter 7, the error 
bound decays on the order of 0{l/^/rn) as a function of 
the sampling parameter m. However, as m increases, e as 
given in Equation (1) also grows, decreasing privacy. However, 
when we fix the overall privacy level e, by adjusting 7 as a 
function of m, as given by Equation (2), in order to maintain 
the desired level of privacy, we observe that increasing m 
too much will cause the error bound to expand. Intuitively, 
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Fig. 4. Authorized researchers apply decryption keys V^ and W^ from 
Alice and Bob to decrypt the message and obtain the perturbed samples. They 
then use the inverse of the PRAM matrix to estimate the true type. 



this can be explained as by having m too large, we need 
to increase the PRAM perturbation through lowering 7 to 
maintain the same level of privacy, which has the adverse 
effect of increasing the error bound through the condition 
number c. On the other hand, by having m too small, too few 
samples are taken resulting in an inaccurate type estimate. 
This balance in adjusting the sampling parameter m shows 
that there is an optimal sample size m as a function of the 
desired level of privacy e and other system parameters. The 
theoretically optimal sample size m for the error upper bound 
is given by the following corollary. 

Corollary 3.6: The optimal sampling parameter m* that 
optimizes the error bound of Equation (3) is 
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Proof: By combining equations for the expected error 
bound. Equation (3), and the required level of 7, Equation (2), 
we have 
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By setting the derivative of this expression to zero, we can 

solve to find the optimal m, 
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IV. Experimental Results 

In order to validate our theoretical results, we conducted 
experiments that simulated our proposed system using the UCI 
'Adult Data Set" [14] and synthetically generated data. The 
UCI 'Adult Data Set" was extracted from the 1994 Census 
database and consists of personal information for over 48 
thousand individuals, with various attributes including age, 
education, marital status, occupation, gender, race, income, 
etc. 

For the first set of experiments, we reduced the cardinality of 
the attribute set by considering only a subset of the attributes as 



well as quantizing some attributes into categories. Specifically, 
we used education (quantized to "no college", "some col- 
lege", or "post-graduate degree"), marital status (quantized to 
"married" or "single/divorced/widowed"), gender (inherently 
categorized as "male" or "female"), and salary (inherently 
categorized as "over 50K" or "50K or less"), resulting in a total 
attribute set cardinality of lA'HJ^I = 24. We also discarded 
any individuals where there was missing information in any 
of these attributes, reducing the size of the total dataset to 
45222 individuals. In this and the remaining experiments, 
while varying the sampling parameter m and overall privacy 
level e, we set the level of PRAM perturbation 7 as dictated 
by Equation (2). The results of the simulations with the UCI 
"Adult Data Set" are presented in Figure 5. The data points 
show the simulation results, with each point being an empirical 
estimate over 1000 independent experiments of the expected 
^2 -norm of the type error. The simulations were conducted for 
three privacy levels e = 0.1, 0.5 and 1.0, and over a wide range 
of sampling parameters m at each level. The corresponding 
theoretical utility bounds (see Equation (3) of Theorem 3.4) 
are illustrated by the solid curves, and the optimal number of 
samples (see Equation (5)) are shown with the vertical lines. 

We make the following observations: Firstly, we observe 
that the theoretical prediction of the optimal number of sam- 
ples aligns well with the experimental results. In other words, 
the optimal sampling factor computed using the theoretical 
bounds is nearly identical to that obtained via experiment, 
for all privacy levels. Secondly, we find that the shape of 
the theoretical bounds is very similar to the shape formed 
by the experimental results, however the theoretical bounds 
are off by about a factor of y^IA'HJ^I. To verify this, note 
that the shape of these bounds, when divided by a factor of 
Y^|A'||3^| and plotted with the dashed lines aligns well with 
the experimental results. We confirmed that this behavior is 
reproduced even when we change the cardinality of the data. 
We observed this behavior over various cardinalities ranging 
from 12 to 768, with 1000 independent experiments conducted 
at each cardinality. 

The looseness of the theoretical bounds can perhaps be 
explained by the bounding technique used in Equation (4) on 
the ratio of ^2 -norms. 
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which introduces a pessimistic factor of y^|A'||3^| when 
bounding with the ratio of ^1 -norms. This pessimistic bound is 
approached only if \\T^^ ym II2 is close to 1 (or, equivalently, 
T^rn ym ^^ closc to 3. dclta function) and HATj^^ yrr.||2 is 
close to I/y^IA'IIJ^I (or, equivalently, AT^^ ym is close to 
uniform). Note that, while the gap can be made arbitrarily 
small, the bound cannot be met with exact equality due to 
the 7-diagonal structure of A with 7 > 1. However, when the 
type of the data T^^ ym is (or close to) uniform, the bound 
is loose as the ratio of ^2 -norms is equal (or close) to one. In 
our experiments, we have seen that the results with uniformly 
distributed synthetic data closely matches those with the real 
data, and appears to either match or bound the utility results 
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Fig. 5. The experimental results from simulations with the UCI dataset 
are plotted as points alongside the theoretical results. The experiments were 
conducted for three privacy levels (e) and across a range of number of samples 
(m). Each data point represents the expected -^2 -norm of the type error 
estimated as the empirical mean over 1000 independent experiments. The 
solid curves illustrate the theoretical error bound, and the solid vertical lines 
illustrate theoretically optimal number of samples at each privacy le vel. The 
dashed lines correspond to the error bound divided by a factor of y^|A'||3^| 
to illustrate that the bounds seem to capture the correct shape, albeit being 
loose by a multiplicative factor. 



for the other synthetic distributions. If we tighten this bound 
by replacing the ratio of £2 -norms with one (assuming that this 
is a reasonable bounding approximation), the utility bound of 
Equation (3) becomes 
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which reduces the overall error bound by roughly a factor of 
Y^|A'||3^|, since the condition number c typically dominates 
over one. 

Next, we conducted simulations with synthetically gener- 
ated data. We generated synthetic datasets of the same length 
as the UCI dataset (n = 45222) and cardinality (| A'l |3^| = 24), 
but with three different distribution shapes, "uniform", "lin- 
ear", and "peaky". The "uniform" dataset is simply uniformly 
distributed over the attribute set. The "linear" dataset has a type 
function that linearly increases from (1/q) for the least fre- 
quent attribute to {2A/q) for the most frequent attribute, where 
g' := (1 + . . . + 24) is a normalizing constant. In the "peaky" 
dataset, the most frequent attribute dominates the distribution 
at 90 percent, while the other attributes uniformly share the 
remaining 10 percent of the distribution mass. The experiments 
with the "uniform" and "linear" synthetic datasets produced 
results that were very similar to those with the UCI dataset. 
These results are plotted alongside the UCI dataset results in 
Figures 6 and 7, respectively. However, the experiments with 
the "peaky" synthetic dataset, presented in Figure 8, produced 
markedly different results than the UCI dataset experiments 
for lower values of m. We confirmed that this behavior is 
reproduced in experiments when the cardinality of the dataset 
is varied from 12 to 768. We conjecture that this is due to 
the high skewedness of the "peaky" synthetic dataset, which 
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Fig. 6. The experimental results from simulations with the UCI dataset 
are plotted alongside the results from simulations with synthetic data with 
a "uniform" distribution. The vertical lines illustrate theoretically optimal 
number of samples at each privacy level. Each data point for both datasets 
was produced from 1000 independent experiments. 
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Fig. 7. The experimental results from simulations with the UCI dataset 
are plotted alongside the results from simulations with synthetic data with a 
"linear" distribution. The vertical lines illustrate theoretically optimal number 
of samples at each privacy level. Each data point for both datasets was 
produced from 1000 independent experiments. 
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Fig. 8. The experimental results from simulations with the UCI dataset 
are plotted alongside the results from simulations with synthetic data with a 
"peak" distribution. The vertical lines illustrate theoretically optimal number 
of samples at each privacy level. Each data point for both datasets was 
produced from 1000 independent experiments. 
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Fig. 9. The experimental results from simulations with the UCI dataset 
are plotted alongside the results from simulations with synthetic data. The 
number of samples for each pair of cardinality and privacy level is computed 
by Equation (6). Each data point for both datasets was produced from 1000 
independent experiments. 



effectively reduces the impact of the cardinality of the dataset 
resulting in decreased error for lower values of m, the number 
of samples. 

Our experiments confirm that using the optimal number 
of samples (m*) derived from theoretical bound in Equa- 
tion (6) consistently achieves near the minimum error in our 
experiments. This is observed in all experiments with differ- 
ential cardinalities, different data distributions and different 
privacy levels. We plot the ^2 -norm error in the estimated 
joint distribution for the optimal number of samples m* in 
Figure 9 for real and synthetic data experiments, at all three 
levels of privacy. The error curve of the UCI dataset overlaps 
with the error curve of the "linear" distribution and the error 
curve of the "uniform" distribution. The error of the "peaky" 



distribution is consistently lower than other distributions. As 
mentioned above, we conjecture that this is due to the high 
skewedness of this synthetic dataset which effectively reduced 
the impact of the cardinality on the utility measure. 

V. Discussion 

We conclude our paper with a brief discussion to summarize 
our results and outline practical considerations toward imple- 
menting our proposed system. 

A. Summary of Results 

We analyzed a proposed system that combines sampling 
with PRAM to produce a privacy-preserving mechanism that 
enables data release for statistical analysis. The sampling stage 



has two benefits in the system: 1) it enhances the system 
privacy improving the privacy-utiHty tradeoff, 2) it reduces the 
costs of one-time-pad encryption that provides strong security 
against a facihtating server. SampHng reduces the amount of 
PRAM noise needed to provide a desired level of privacy, but 
oversampling will actually degrade the estimation performance 
since too much noise is required to maintain privacy. However, 
undersampling will also degrade estimation performance since 
less data is gathered. In this balance, there is an optimal 
sampling parameter, which we found in our analysis and 
confirmed in experiments with real and synthetic data. 

B. Practical Considerations 

The privacy-preserving framework described in this work is 
easy to implement in practice with very small modifications to 
the abstract setting of this paper. For instance, in the problem 
setting discussed above, encryption was accomplished by 
means of a one-time-pad which is an information-theoretic ab- 
straction. Actually using one-time-pads may be feasible if the 
sampling parameter is small enough to allow key distribution 
at a reasonable cost. However, a practical alternative would be 
to perform encryption with a conventional stream cipher, with 
the key provided to the curators and the authorized researcher 
but not to the server. From the perspective of the authorized 
researcher and the database respondents, the privacy-utility 
tradeoff remains the same. The only change is that, the data 
released by the curators has computational privacy instead 
of information theoretic privacy against the server. In other 
words, a computationally bounded server cannot recover the 
data sampled by the curators. 

Furthermore, several interesting variants of the proposed 
framework are possible owing to the fact that sampling, en- 
cryption and PRAM-based perturbation can commute without 
changing the privacy-utility tradeoff. The ordering of these 
operations is flexible allowing other architectures with the 
parties performing different roles. For instance, if the curators 
want a secure external database storage facility, then they could 
encrypt the full database with a stream cipher, and request that 
the server perform both sampling and PRAM. 

An important practical issue that has not been addressed 
in this work is the synchronization of the curators' databases 
and the sampling phase. In our development, it is assumed 
that the respondents in Alice's and Bob's database are already 
synchronized and that they are able to sample in the same 
locations. A practical approach toward database synchroniza- 
tion could involve using secure hashes of the unique IDs 
associated with each record, if available. Synchronization of 
the sampling locations could be accomplished by either the 
curators directly sharing the sampling indices (using no more 
than m log n bits of communication) or by sharing the seed of 
a cryptographically secure pseudorandom number generator, 
that drives the choice of the sampling locations. In the latter 
approach, the use of pseudorandomness would affect the 
statistical privacy guarantees against the researcher, however 
the practical impact would likely be insignificant against a 
computationally bounded researcher. If the application allows 
for flexible architectures as described earlier, another alterna- 
tive would be to have the sampling performed by the server. 
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